-
Notifications
You must be signed in to change notification settings - Fork 31
Align default argument values with evaluation repository #377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Update args_parser.py and benchmark-specific run_infer.py files to use default values that match the evaluation repository (OpenHands/evaluation) eval-job/values.yaml configuration. Shared defaults updated in args_parser.py: - workspace: 'docker' -> 'remote' - max-iterations: 100 -> 500 - critic: 'pass' -> 'finish_with_patch' Benchmark-specific overrides using parser.set_defaults(): - gaia: dataset='gaia-benchmark/GAIA' - swtbench: dataset='eth-sri/SWT-bench_Verified_bm25_27k_zsp' - commit0: max_attempts=1, max_retries=1 (in addition to existing dataset) Also updated AGENTS.md to document the default values alignment pattern. Co-authored-by: openhands <openhands@all-hands.dev>
…hmultimodal - swebench: Add explicit set_defaults(dataset, split) for consistency with other benchmarks, even though values match global defaults - swebenchmultimodal: Update comment to match the pattern used in other benchmarks Co-authored-by: openhands <openhands@all-hands.dev>
Each benchmark now sets its own dataset default via set_defaults(), so no global default is needed. Co-authored-by: openhands <openhands@all-hands.dev>
All benchmarks in the evaluation repository use .llm_config/runtime.json as the LLM config path, so use this as the default. Co-authored-by: openhands <openhands@all-hands.dev>
This reverts commit c34d730.
- Created config.py with INFER_DEFAULTS and EVAL_DEFAULTS for each benchmark - Removed all defaults from utils/args_parser.py - Removed default from critics.py - Updated swebench, gaia, swtbench run_infer.py to use INFER_DEFAULTS - Started commit0 update (import added) Co-authored-by: openhands <openhands@all-hands.dev>
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
|
@OpenHands fix precommit issues |
|
I'm on it! simonrosenberg can track my progress at all-hands.dev |
Co-authored-by: openhands <openhands@all-hands.dev>
|
I've fixed the pre-commit issues and pushed the changes to the PR branch. Summary of ChangesThe pre-commit checks were failing due to ruff lint errors related to import ordering. The fixes were:
Verification
The PR should now pass the pre-commit CI checks. |
Align EVAL_DEFAULTS with NUM_EVAL_WORKERS from evaluation repository values.yaml: - GAIA: workers=1 - Commit0: workers=1 Co-authored-by: openhands <openhands@all-hands.dev>
Update eval_infer.py files to import and use EVAL_DEFAULTS from their respective config.py files via parser.set_defaults(): - swebench/eval_infer.py: uses EVAL_DEFAULTS for dataset, model_name, workers - swtbench/eval_infer.py: uses EVAL_DEFAULTS for dataset, model_name, workers - swebenchmultimodal/eval_infer.py: uses EVAL_DEFAULTS for dataset, split, model_name, workers This ensures the default values defined in config.py are actually used by the evaluation scripts, aligning with the pattern used in run_infer.py files for INFER_DEFAULTS. Co-authored-by: openhands <openhands@all-hands.dev>
…infer.py Update run_infer.py files to import and use INFER_DEFAULTS from their respective config.py files via parser.set_defaults(): - commit0/run_infer.py: uses INFER_DEFAULTS for all inference settings - swebenchmultimodal/run_infer.py: uses INFER_DEFAULTS for all inference settings This ensures the default values defined in config.py are actually used by the inference scripts, completing the alignment with the evaluation repository values.yaml. Co-authored-by: openhands <openhands@all-hands.dev>
Update eval_infer.py files to import and use EVAL_DEFAULTS from their respective config.py files via parser.set_defaults(): - commit0/eval_infer.py: uses EVAL_DEFAULTS for model_name - gaia/eval_infer.py: uses EVAL_DEFAULTS for model_name This ensures all benchmarks consistently use their config.py defaults. Co-authored-by: openhands <openhands@all-hands.dev>
These fields are not benchmark-specific and should have global defaults:
- note: 'initial' (user-facing option for run identification)
- n_limit: 0 (no limit by default)
- output_dir: OUTPUT_DIR from constants.py ('./eval_outputs')
Added OUTPUT_DIR constant to benchmarks/utils/constants.py.
This keeps INFER_DEFAULTS focused on benchmark-specific values from
the evaluation repository's values.yaml.
Co-authored-by: openhands <openhands@all-hands.dev>
- gaia: Remove max_retries from INFER_DEFAULTS (not used in run_infer.py) - gaia: Remove workers from EVAL_DEFAULTS (not used in eval_infer.py) - commit0: Remove workers from EVAL_DEFAULTS (not used in eval_infer.py) Each config now only contains fields that are actually used by the corresponding run_infer.py and eval_infer.py scripts. Co-authored-by: openhands <openhands@all-hands.dev>
Remove the default value 'initial' from --note argument. When not specified, no note identifier is appended to the output directory. The construct_eval_output_dir function already handles None/empty values gracefully by not appending the _N_ suffix. Co-authored-by: openhands <openhands@all-hands.dev>
Replace hardcoded dataset, split, and repo_split values with references to INFER_DEFAULTS in: - commit0/run_infer.py: Commit0Evaluation class __init__ and prepare_instances - commit0/build_images.py: set only the specific defaults needed (dataset, split, repo_split) This ensures all commit0 code uses the centralized config values. Co-authored-by: openhands <openhands@all-hands.dev>
98a6a58 to
e53928f
Compare
The commit0 eval_infer.py is a simple JSON processor that doesn't need centralized defaults. Reverted to main version. Co-authored-by: openhands <openhands@all-hands.dev>
The gaia eval_infer.py is a simple JSON processor that doesn't need centralized defaults. Reverted to main version. Co-authored-by: openhands <openhands@all-hands.dev>
Import DEFAULT_DATASET, DEFAULT_CLI_MODEL_NAME, DEFAULT_EVAL_WORKERS from constants.py instead of duplicating the values. This ensures constants.py remains the single source of truth for these values. Co-authored-by: openhands <openhands@all-hands.dev>
… config.py Remove these constants from constants.py and update eval_infer.py to use EVAL_DEFAULTS from config.py instead. config.py is now the single source of truth for dataset, model_name, and workers defaults. Co-authored-by: openhands <openhands@all-hands.dev>
…VAL_DEFAULTS model_name is specific to the CLI and should stay in constants.py. EVAL_DEFAULTS now only contains dataset and workers. Co-authored-by: openhands <openhands@all-hands.dev>
Revert eval_infer.py files to main and remove model_name from EVAL_DEFAULTS. The model_name is hardcoded in the eval_infer.py files. Co-authored-by: openhands <openhands@all-hands.dev>
…nd swtbench eval_infer Import EVAL_DEFAULTS and use parser.set_defaults() to apply them. model_name remains hardcoded in the argument parser. Co-authored-by: openhands <openhands@all-hands.dev>
…d_eval_env_images Update image_utils.py, build_eval_env_images.py, and eval_infer.py to import and use INFER_DEFAULTS instead of hardcoding dataset and split values. Co-authored-by: openhands <openhands@all-hands.dev>
…EVAL_DEFAULTS image_utils.py and build_eval_env_images.py are used for evaluation, so they should use EVAL_DEFAULTS (princeton-nlp/SWE-bench_Verified) not INFER_DEFAULTS (eth-sri/SWT-bench_Verified_bm25_27k_zsp). Added split='test' to EVAL_DEFAULTS to match values.yaml. Co-authored-by: openhands <openhands@all-hands.dev>
Revert AGENTS.md to main version. Restore original docstring example in build_images.py. Co-authored-by: openhands <openhands@all-hands.dev>
Set workspace default='remote' in args_parser.py since it's the same for all benchmarks. Remove workspace from all INFER_DEFAULTS in config.py files. Co-authored-by: openhands <openhands@all-hands.dev>
…ULTS Set max_iterations default=500 in args_parser.py since it's the same for all benchmarks. Remove max_iterations from all INFER_DEFAULTS in config.py files. Co-authored-by: openhands <openhands@all-hands.dev>
Set critic default='finish_with_patch' in critics.py since it's the same for all benchmarks. Remove critic from all INFER_DEFAULTS in config.py files. Co-authored-by: openhands <openhands@all-hands.dev>
Revert benchmarks/utils/constants.py to main version. Hardcode './eval_outputs' as default for --output-dir in args_parser.py. Co-authored-by: openhands <openhands@all-hands.dev>
|
@OpenHands Then clone the evaluation repository and simplify the code as much as possible. Now that the correct values are defined in the benchmarks repository, those values can be totally removed from the evaluation repository. |
|
I'm on it! simonrosenberg can track my progress at all-hands.dev |
Add level to GAIA INFER_DEFAULTS matching production configuration. Make --level argument optional since it now has a default. Co-authored-by: openhands <openhands@all-hands.dev>
|
@OpenHands Please validate that
|
|
I'm on it! simonrosenberg can track my progress at all-hands.dev |
SummaryI've completed the task of simplifying the OpenHands/evaluation repository to leverage the default values now defined in the OpenHands/benchmarks repository. Changes MadeCreated PR: https://github.com/OpenHands/evaluation/pull/223 (draft) Files Modified:
Benefits
DependencyThis PR depends on OpenHands/benchmarks PR #377 which aligns the default argument values in the benchmarks repository. The evaluation PR should be merged after the benchmarks PR is merged. |
Code Review: PR #377 - Align Default Argument Values with Evaluation RepositoryTaste Rating: 🟡 Acceptable - Works but has one critical gapHyperparameter Validation SummaryI compared the PR's config.py files against the evaluation repository's ✅ Values That Match Correctly
[CRITICAL ISSUES] - Must Fix🔴 [benchmarks/gaia/config.py] Missing
|
|
@juanmichelini commit0 was set to max_retries max_attempts = 1 because commit0 runs are extremely long and it would always rerun 3 times (because back then the critic model finish_with_patch was incorrectly setting everyone to false) |
f27ac88 to
d7be118
Compare
- Keep default=3 for max_attempts and max_retries in args_parser.py - Remove redundant max_attempts=3 and max_retries=3 from config.py files (gaia, swebench, swebenchmultimodal, swtbench) since they match the default - Keep max_attempts=1 and max_retries=1 in commit0/config.py since it differs from the default - Remove max_retries from commit0/build_images.py set_defaults (uses global default) Co-authored-by: openhands <openhands@all-hands.dev>
d7be118 to
19be07f
Compare
max_retries should always be 3 since it's to mitigate infra errors. max_attempts instead can be set to 1 |
|
@simonrosenberg why does commit0 only has 8 workers? |
It was set like this at the start because 16 seemed too much to me. But 16 is actually nothing so it should be bumped to 16 |
Summary
This PR aligns the default argument values in the benchmarks repository with the values used in the evaluation repository (
OpenHands/evaluation).Changes
Global defaults in
args_parser.py:These defaults are the same across all benchmarks and are set directly in
args_parser.py:--workspaceremote--max-iterations500--criticfinish_with_patch--output-dir./eval_outputs--n-limit0Benchmark-specific defaults via
config.pyandparser.set_defaults():Each benchmark has a
config.pyfile withINFER_DEFAULTS(and optionallyEVAL_DEFAULTS) that are applied viaparser.set_defaults():commit0dataset,split,repo_split,num_workers=8,max_attempts=1,max_retries=1gaiadataset,split=validation,num_workers=30,max_attempts=3swebenchdataset,split=test,num_workers=30,max_attempts=3,max_retries=3swebenchmultimodaldataset,split=dev,num_workers=30,max_attempts=3,max_retries=3swtbenchdataset(SWT-bench),split=test,num_workers=30,max_attempts=3,max_retries=3Evaluation defaults (
EVAL_DEFAULTS):swebenchdataset,workers=12swebenchmultimodaldataset,split=dev,workers=12swtbenchdataset(SWE-bench),split=test,workers=24Note:
swtbenchuses different datasets for inference (SWT-bench) vs evaluation (SWE-bench).Benefits
Testing